Method and apparatus for alignment of signals for use in DNA base-calling

ABSTRACT

Data traces from four channels of an automated electrophoresis detection apparatus are aligned by identifying peaks in each of the four data traces; optionally normalizing the data traces to achieve a uniform peak height; combining the four data traces in an initial alignment; and determining coefficients of shift and stretch for selected data points within each data trace. The coefficients are determined by optimizing a cost function which reflects the extent of overlap of peaks in the combined normalized data traces to which the coefficients have been applied. The cost function is optimized when the extent of overlap is at a minimum. The coefficients are then used to generate a warp function for each data trace. These warp functions are applied to their respective data traces to produce four warped data traces which are aligned to form an aligned data set. The aligned data set may be displayed on a video screen of a sequencing apparatus, or may be used as the data set for a base-calling process.

This application is a national phase of International Application No.PCT/CA97/00463, which is a continuation-in-part of U.S. patentapplication Ser. No. 08/670,534, filed Jun. 27, 1996, which is now U.S.Pat. No. 5,916,747.

DESCRIPTION Background to the Invention

This invention relates to a method of processing output signals from anautomated electrophoresis detection apparatus, and to an apparatus whichemploys this method for sequencing nucleic acids.

One of the steps in nucleotide sequence determination of a subjectnucleic acid molecule is interpretation of the pattern of nucleic acidfragments which results from electrophoretic separation of fragments, orreaction products, of a DNA sequencing reaction (the “fragmentpattern”). The interpretation, colloquially known as “base calling”,involves determination from the recorded fragment pattern of the orderof four nucleotide bases, A (adenine), C (cytosine), G (guanine) and T(thymine) for DNA or U (uracil) for RNA in the subject nucleic acidmolecule.

The chemistry employed for a DNA sequencing reaction using the dideoxy(or chain-termination) sequencing technique is well known, and was firstreported by Sanger et al. (Proc. Natl. Acad. Sci. USA 74: 5463-5467(1977)). Four samples of nucleic acid fragments (terminating in A, C, G,or T(U) respectively in the Sanger et al. method) are loaded at aloading site at one end of an electrophoresis gel. An electric field isapplied across the gel, causing the fragments to migrate from theloading site towards the opposite end of the gel. During thiselectrophoresis, the gel acts as a separation matrix. The fragments,which in each sample are of an extended series of discrete sizes,separate into bands of discrete species in a lane along the length ofthe gel. Shorter fragments generally move more quickly than largerfragments.

If the DNA fragments are labeled with a fluorescent label, an automatedelectrophoresis detection apparatus (also called a “DNA sequencer”) canbe used to detect the passage of migrating bands in real time. Existingautomated DNA sequencers are available from Applied Biosystems, Inc.(Foster City, Calif.), Pharmacia Biotech. Inc. (Piscataway, N.J.),Li-Cor, Inc. (Lincoln, Nebr.), Molecular Dynamics, Inc. (Sunnyvale,Calif.) and Visible Genetics Inc. (Toronto). Other methods of detection,based on detection of features inherent to the subject molecule, such asdetection of light polarization as disclosed in U.S. Pat. No. 5,543,018which is incorporated herein by reference, are also possible.

A significant problem in determining a DNA sequence, encounteredparticularly with high speed DNA sequencing and in sequencing apparatuswhich do not combine the four sets of sequencing reaction products in asingle lane, is alignment of data signals from the four different outputchannels of an automated DNA sequencing apparatus. Once data is alignedproperly, it is relatively straight-forward to base-call it. However,this initial step can be very challenging since the output signal may beerratically shifted and/or stretched as a result of chemistry and gelanomalies. A reliable method of aligning data, that can produce datawhich takes into account non-linear shifting and stretching of signaloutput, is highly desirable particularly for high-speed DNA sequencing.

Existing prior art determinants in this field are very limited. Existingautomated sequencers traditionally operate at voltages low enough thatnon-linear shifting is avoided. The use of low voltages, however, limitsthe speed with which separation of sequencing fragments into discretebands can be accomplished.

Published methods of computer assisted base calling include the methodsdisclosed by Tibbetts and Bowling (U.S. Pat. No. 5,365,455) and Dam etal (U.S. Pat. No. 5,119,316) which patents are incorporated herein byreference. Both patents assume alignment of output signals and addressonly aspects of base-calling from the aligned signals.

It is an object of the present invention to provide a method of aligningreal-time signals from the output channels of an automatedelectrophoresis apparatus.

It is a further object of the invention to provide an improved method ofbase-calling an DNA signal sequence aligned according to the invention.

It is still a further object of the invention to provide an apparatusfor sequencing nucleic acids which utilizes the improved method inaccordance with the invention for aligning real-time signals from theoutput channels of an automated electrophoresis apparatus.

SUMMARY OF THE INVENTION

These and other objects of the invention are achieved using a method foraligning data traces from four channels of an automated electrophoresisdetection apparatus, each channel detecting the products of one of fourchain-termination DNA sequencing reactions, whereby said four channelstogether provide information concerning the sequence of all four baseswithin a nucleic acid polymer being analyzed, comprising the steps of:

(a) identifying peaks in each of the four data traces;

(b) normalizing the height of said peaks in each of said data traces toa common value to generate four normalized data traces if the peaks arenot of substantially equal height;

(c) combining the four normalized data traces in an initial alignment;

(d) determining coefficients of shift and stretch for selected datapoints within each normalized data trace, said coefficients optimizing acost function which reflects the extent of overlap of peaks in thecombined normalized data traces to which the coefficients have beenapplied, said cost function being optimized when the extent of overlapis at a minimum;

(e) generating warp functions for the normalized data traces from thecoefficients of shift and stretch determined for the selected datapoints;

(f) applying the warp functions to the respective data trace ornormalized data trace to produce four warped data traces; and

(g) assembling the four warped data traces to form an aligned data set.

The aligned data set may be displayed on a video screen of a sequencingapparatus, or may be used as the data set for a base-calling process.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows the alignment process of the invention in flow chart form;

FIG. 2 illustrates a preliminary (unaligned) signal recorded from anautomated DNA sequencing apparatus over 15 seconds;

FIG. 3 illustrates the detection of peaks from a preliminary signal;

FIG. 4 illustrates a normalized data stream:

FIGS. 5A and 5B illustrate a normalization method employed for peaks andvalleys;

FIGS. 6A and 6B illustrate the preliminary alignment of four normalizeddata streams representing time based alignment of each of the A, C, Gand T output channels;

FIG. 7 illustrates alignment of the 4 normalized data streams withrespect to minimization of the “cost” function;

FIGS. 8A and 8B shows exemplary warp functions for data points in the C,G and T traces relative to the A trace;

FIG. 9 illustrates a method for determining a standard gaussian peakwidth;

FIG. 10 shows an example of the classification of singletons using thefeatures of peaks in the detected signal;

FIG. 11 illustrates a base-calling method useful with the aligned datatraces; and

FIG. 12 shows an apparatus in accordance with the invention.

DETAILED DESCRIPTION OF THE INVENTION

The present invention provides a method of aligning data traces whichcan be used to align data traces from an automated electrophoresisdetection apparatus for use in base-calling. In accordance with theinvention, an optimization routine such as a simulated annealingalgorithm is used to determine coefficients of stretch and shift toalign data traces, each representing one of the four chain-terminationDNA sequencing reactions. With high quality data trace alignment,base-calling may proceed with a high degree of accuracy andrepeatability.

“Data trace” as used in the specification and claims of this applicationrefers to the series of peaks and valleys representing the migratingbands of oligonucleotide fragments produced in one chain terminationsequencing reaction and detected in a DNA sequencer. Such data tracesare sometimes referred to as sequence chromatograms or a chromatographictrace. The data trace may be either a raw data trace or a “conditioned”data trace.

“Shift” as used in the specification and claims of this application isinterchangeable with “offset” and refers to the number of data pointswhich the signal output is displaced, either positively or negativelyfrom its original position.

“Stretch” as used in the specification and claims of this applicationrefers to increase or decrease in spacing between data points of asignal output relative to the original spacing. The stretch may beconstant across a segment of data or may follow a second or higher orderpolynomial.

A “warp function” as used in the specification and claims of thisapplication is an instantaneous representation of optimized shift andstretch at each data point. The warp function can be representedgraphically either as a data point-for-data point plot for one datatrace versus a standard or second data trace, or as a plot ofdisplacement as a function of data point number. Standard traces may bea trace representing an average of multiple experimental runs, a baseseparation function, or a standard derived from a text sequence. Thewarp function is applied to a data trace to obtain a “warped datatrace.”

FIG. 1 shows the alignment process of the invention in flow chart form.As shown, the first step in the method is the identification of peaks ineach of several (generally four) data traces. In some cases, thesetraces are next normalized to a constant height. This step may beunnecessary if the DNA Sequencer and chemistry used producesubstantially uniform peak sizes (homozygous peak heights within about10%) or if the traces are nearly aligned from the sequencer. The term“normalized data traces” refers to data traces having this level ofconsistency, whether as a result of normalization step or as a result ofsufficient initial uniformity. Normalization may also be unnecessary,even for non-uniform peak sizes, if the cost function used in thealignment is independent of peak height and area.

The data traces are then combined in an initial alignment and dividedinto a plurality of segments or windows. For each segment, optimizedstretch and shift values are determined using an iterative optimizationroutine which converges towards a “best” solution for alignment of thedata traces. The quality of the various trial alignment is assessed forcomparison purposes using a cost function. Various cost functions can beemployed including (1) a cost function which minimizes overlap of thedata traces combined within the segment, and (2) a cost function whichmaximizes the regularity of peak distribution the segment.

Once a “best” solution has bene found, the shift and stretch values areused to define warp functions. These warp functions are applied to theindividual data traces to produce warped data traces which are assembledas aligned data useful for base-calling.

To illustrate the use of this method, FIG. 2 shows a typical raw datatrace for one of the four chain-termination DNA sequencing reactions, asdetected on a Visible Genetics Inc. MicroGene Blaster™. The X-axisrepresents time while the Y axis represents fluorescence detection. Thedata trace reveals a series of bands of fluorescent molecules passingthrough the detection site, as expected from a typical chain-terminationDNA sequencing reaction.

FIG. 2 shows several of the features which complicate the use of rawdata for base-calling, and illustrates the need for the presentinvention to provide appropriately aligned signals for this purpose. Inparticular, as reflected in FIG. 2, the fluorescence intensity can varyfrom one band to another. In addition, not all bands are fully resolved,and the spacing between adjacent bands is not always an integralmultiple of the theoretical spacing between adjacent bands. The presentinvention provides a method for converting this raw signal, and itsthree counterpart signals for the other three sequencing reactions, intoaligned data which is highly suitable for base-calling.

The data trace which is processed in accordance with the method of theinvention is preferably a signal collected using the fluorescencedetection apparatus of an automated DNA sequencer. However, the presentinvention is applicable to any data set which reflects the separation ofoligonucleotide fragments in space or time, including real-time fragmentpatterns using any type of detector, for example a polarization detectoras described in U.S. Pat. No. 5,543,018; densitometer traces ofautoradiographs or stained gels, traces from laser-scanned gelscontaining fluorescently-tagged oligonucleotides: and fragment patternsfrom samples separated by mass spectrometry.

In the method of the invention, four data traces, one for eachsequencing reaction, are normalized as needed to correct for variationsin peak height using the procedure described below. The normalized datatraces are then used to determine a series of stretch and shiftcoefficients, which are then applied to the normalized data traces toarrive at aligned data traces.

Prior to normalizing and aligning the data traces for the foursequencing reactions using the method of the invention, however, it maybe advantageous to condition the signal, although this step is notrequired. This conditioning can be done, for example, using conventionalbaseline correction and noise reduction techniques to yield a“conditioned” data trace. As is known in the art, three methods ofsignal processing commonly used are background subtraction, lowfrequency filtration and high frequency filtration, and any of these maybe used, singly or in combination to produce a conditioned signal to beused as a conditioned data trace in the method of the invention.

Preferably, the data is conditioned by background subtraction using anon-linear filter such as an erosion filter, with or without a low-passfilter to eliminate systemic noise. The preferred low-pass filtrationtechnique is non-causal gaussian convolution.

After any needed conditioning of the data trace is performed, the datatrace is normalized as needed to generate a “normalized data trace”which is used to determine coefficients of stretch and shift forbase-calling. The normalization process includes the following steps.

Firstly, the data trace (raw or conditioned) is searched for peaks.Peaks can be identified as the middle data point of three consecutivedata points wherein the inside data point is higher than the two outsidedata points. FIG. 3. More sophisticated methods of peak detection arealso possible. For example, a preferred method involves using the“three-point” method to segment the data trace, and then joining thesegments. A trace feature is assigned as an actual peak whenever thedifference between a maximum and an adjacent minimum exceeds a thresholdvalue, e.g., 5%. A minimum peak height from the base-line may also berequired to eliminate spurious peaks.

An exception is made for the so-called “primer peak” and “terminationpeak” which are found in some variations of the chain-terminationsequencing method. These peaks comprise a large volume of unreactedprimer, which tends to interfere with base-calling around the shorterchain-extension products, and a large volume of the complete sequencewhich may interfere with base-calling around the longest chain-extensionproducts. These peaks are identified and eliminated from considerationeither on the basis of their size, their location relative to the startand end of the electrophoresis process, or some other method.

After elimination of the primer and termination peaks, the data trace isnormalized so that all of the identified peaks have the same heightwhich is assigned a common value, e.g., 1. (FIG. 4). This processreduces signal variations due to chemistry and enzyme function, andworks effectively for homozygous samples and for many heterozygoteshaving moderate, i.e., less than about 5 to 10%, heterozygosity in a 200base pair or larger region being sequenced. It will be appreciated thatwhen the data trace has substantially uniform peak height, normalizationcan be omitted.

To normalize the data trace, the points between each peak are assigned anumerical height value based on their position in the data tracerelative to a hypothetical line joining consecutive peaks and the baseline of the signal. Thus, as shown in FIG. 5A, the valley between peaks1 and 2 has a minimum at a point which is approximately 25% of thedistance from the baseline B to the line C joining peaks 1 and 2. Theminimum of this valley is therefore assigned a value of about 0.25. (SeeFIG. 5B). Similarly, the valley between peaks 2 and 3 has a minimum at apoint which is approximately 80% of the distance from the baseline B tothe line D joining peaks 2 and 3. The minimum of this valley istherefore assigned a value of about 0.8 in the normalized data trace.

The next step in the method of the invention is the determination ofshift and stretch coefficients for a set of four normalized data traces,one for each sequencing reaction. This is accomplished by combining thefour normalized data traces as shown diagrammatically in FIGS. 6A and6B, and determining coefficients of shift and stretch for selected datapoints within each normalized data trace which optimize a “cost”function. The cost function generally reflects the suitability of atrial alignment resulting from the application of the trial stretch andshift coefficients to the combined data traces. One type of costfunction evaluates the extent of overlap of peaks in the combinednormalized data traces to which the coefficients have been applied, andis considered optimized when the extent of overlap is at a minimum. Itwill be understood that the terms “best solution,” “optimized” and“minimum” as used herein do not require absolute optimization to anabsolute minimum, which could require unreasonably long periods ofanalysis time, but only require a practical level of optimizationsufficient to achieve satisfactory alignment of the data traces forbase-calling.

In one embodiment, the cost function measures the total area above thecombined normalized data traces, i.e., the dotted area in FIG. 5B. Thenormalized data traces are then shifted and stretched in an effort tominimize the value of “cost.” In a second embodiment, “cost” is setequal to the area below the combined normalized data traces, and thedata traces are then shifted and stretched to maximize the value of cost(i.e. reducing the overlap of the peaks). However, it has been foundempirically that this latter approach emphasizes less valuable featuresof the data traces than using the area above the curves as the costfunction.

In another embodiment, the cost function combines the total area abovethe combined data traces (the dotted regions in FIG. 6B) with the areabelow the highest edge formed by the combined data traces and above thesecond highest edge of the combined data traces (the lined regions inFIG. 6B). The cost function is at a minimum when the first of theseareas is minimized and the second of these areas is maximized. When thecoefficients which produce the minimum cost are applied to thenormalized data traces, an aligned normalized data set as shown in FIG.7 results.

Where the likely sequence of the DNA being analyzed is known, forexample in repetitive diagnostic applications, the cost function used inthe method of the invention may also compare the experimental datatraces to a set of model data traces. In this case, a suitable costfunction is the area between the experimental data trace and acorresponding data trace from a set of model data traces that have beenprealigned. The cost function is optimized when the area between thetraces is at a minimum. Standard sets of model data traces comprised ofrandom distributions of bases might also be used in such a comparison,although this increases the cost function space.

Another form of cost function that can be used in the method of theinvention evaluates the regularity of peak spacing achieved by the trialalignment. Such cost functions do not require normalization, becausethey do not depend on peak height or area. An example of such a functionwould be a function determining the standard deviation of the distancebetween adjacent peaks, excluding obvious outliers, with a low value forthe cost function being desirable. Another example of a cost function ofthis type determines the distance between each peak and the closest baseposition within a postulated regular array of peak spacings. The costfunction is optimized when the standard deviation of these distances forall of the peaks is at a minimum.

The postulated peak spacing in this array is tested over a several trialvalues established for a specific instrumental and experimentalconfiguration, thus adding an additional dimension to the cost functionspace. For example, where it is found that the normal peak spacing is 7data points, trial peak spacings of 3 to 9 data points are suitablytested for each trial alignment to determine the lowest value of thecost function.

It will further be appreciated that the various cost functions discussedabove can be used individually or in combination since they all worktowards a common goal, the optimum alignment of the data traces. Thus,alternative cost functions can be created as the product or dividendthat results when two or more of the cost functions are combined.

Because the optimization of the stretch and shift coefficients is aniterative process involving the testing of many combination of trialvalues, when the cost space is too large reasonably to permitfull-sampling of the cost space, it is desirable to use an optimizationroutine which facilitates convergence towards an acceptable set ofcoefficients. It will be appreciated by persons skilled in the art thatthere are many types of optimization routines using random or directedsparse sampling techniques which might be employed, including geneticalgorithms and Monte Carlo techniques. A preferred method fordetermining the coefficients of stretch and shift that yield the optimumvalue of “cost” employs “simulated annealing.”

Simulated annealing is a mathematical method of searching a broadparameter space for the “best fit” result, without having to test everymember of the parameter space. This method is particularly relevant tosignal alignment problems in high speed DNA sequencers because theparameter space for possible coefficient of stretch and shift needs tobe extremely broad to accommodate the unpredictability and variationwithin each electrophoresis run.

The preferred manner of performing a simulated annealing calculationemploys a computer. Many computer algorithms employing simulatedannealing are known and available to those skilled in the art. Ofparticular interest are papers by Ingber et al.:

Ingber, A. L., “Very fast simulated re-annealing,” J Mathl. Comput.Modelling 12(8): 967-973 (1989);

Ingber, A. L. et al., “Genetic algorithms and very fast simulatedreannealing: A comparison,” J Mathl. Comput. Modelling 16 (11):87-100(1992);

Ingber, A. L., “Simulated annealing: Practice versus theory,” J Mathl.Comput. Modelling 18(11):29-57 (1993); and

Ingber, A. L., “Adaptive simulated annealing (ASA): Lessons learned, ” JControl and Cybernetics 25(1):33-54 (1996).

Each of these papers is incorporated herein by reference

As employed in the present invention, simulated annealing determinescoefficients of stretch and shift for signal outputs from a VisibleGenetics MicroGene Blaster™ as follows. Consider a window of normalizedsignal output. The normalized data trace consists of a series of datapoints generated every 0.5 seconds. Each peak consists of about 6 to 7data points, and requires 3 to 3.5 seconds to pass through the detectionzone. For convenience of illustration, the window shown represents about90 data points or about 15 peaks. In a preferred embodiment, however,windows of 180 to 350 data points, representing about 30 to 50 peaks,most preferably of about 250 data points representing about 40 peaks areused.

One window is created for data from each of the four data traces, andthe windows would be initially aligned on the basis of data-pointnumber. Superimposing the four windows reveals a non-minimized “cost”result, that is, the amount of “cost” area is greater than it could be.

It is found that a first order equation can be applied to each point ofa data trace within the window to modify its position and change thecost area:

Y=mX+b

where Y=the new position of point X, b=offset (or shift), and m=stretch.A second order or higher equation (with coefficients in addition tostretch and shift being determined) is apparently not necessary toobtain satisfactory results, although second or higher order equationsmay be used for more sophisticated analyses.

The parameter space for b and m is empirically selected. The offsetcoefficient (b) usually falls within 30 data points of the initialtime-based alignment. Offset steps of 0.35 data points are suitablyemployed, thus providing a parameter space of 200 choices. For MicroGeneBlaster™ data, this offset represents about 5 data peaks. The stretchcoefficient (m) usually falls within 5%. These steps are suitably madein 0.66% amounts, thus requiring 15 steps to cover the whole range. Thetotal range of parameters for stretch and shift for each signal outputis therefore 3000 (200*15). The range of parameters for alignment ofthree channels with respect to the fourth channel is therefore 27billion (3000³).

For each combination of the six parameters tested the coefficient areapplied to the three signal output functions and the functions arestretched and/or shifted accordingly. The adjusted functions aresuperimposed with the fourth signal output and the “cost” area isrecalculated. The cost will either be lower, higher or the same asbefore.

Efficient selection of the six parameters is crucial for the discoveryof the parameters which provide the lowest cost. The simulated annealingtheory selects parameters for testing according to a variation of theMonte Carlo search technique “Boltzmann Annealing” known as AdaptiveSimulated Annealing (or Very Fast Simulated Re-Annealing). SimulatedAnnealing code is generally available to those skilled in the art overthe Internet at http:H/www.ingber.com. The code provides operationalsteps for rapidly searching a large parameter space for an optimalsolution given a cost function. An explanation of Simulated Annealing isfound in Ingber, A. L. “Very Fast Simulated Re-Annealing” J. Math.Comput. Modeling (1989) 12:967-973.

The simulated annealing technique employed in the invention usesalgorithms which are well-known to those skilled in the mathematicalarts in the following novel fashion. Coefficients of stretch and shiftare at first randomly selected and applied to the normalized datatraces. The cost is calculated. New coefficients are then selectedwithin a range defined by the annealing schedule (or temperaturefunction “T”) which governs the amount by which coefficients may bechanged with each trial. The new coefficients are applied and cost isagain determined. If the value of cost is lower than before, then thenew point is used as the starting point for the next calculation. If thevalue of cost is higher, then the original coefficients are usually usedagain as the starting point of the next calculation. As in all simulatedannealing processes, however, there is a finite probability P (initiallyon the order of about 20% or less) that the higher cost value will beused as the starting point. As the number of calculated values increase,T and P are reduced, thus tending to localize the search space around anarea of low cost. Annealing temperature schedules allow the“temperature” parameter to be raised to a higher value again atintervals during the search, emulating the process of annealing used toheat treat metals. Eventually, when the search is fully completed,simulated annealing theory argues that the lowest cost value parameterswill be found. In the above method, approximately 5000 sets ofparameters are tested per calculation, representing 0.00001% of theavailable parameter field

In practice, the “fast annealing” modification of Ingber is found to besatisfactory to obtain cost values low enough to base-call data from theVisible Genetics MicroGene Blaster™ (see Ingber, A. L. “Very FastSimulated Re-Annealing” J. Mathl. Comput Modeling (1989) 12:967-973).

After the successful determination of the best fit parameter set for agiven window of data points, the next window of data is analyzed. Thenext window is selected to be the same number of data points as thefirst window, with an overlap of about 50% with the first window. Again,the coefficients of shift and stretch are,identified which provide thelowest cost value when applied to the signal output functions. Thus,coefficients are determined for a series of piecewise domains, e.g.,piecewise linear or cubic domains. The process of selecting windows andcalculating coefficients continues until all the data has been analyzed.

When calculating the coefficients of stretch and shift for windows afterthe first window, it is sometimes advantageous to use the coefficientsfrom a neighboring window as the starting point for the simulatedannealing process since the coefficients for neighboring windows tend tobe related. In such cases, the annealing schedule T can be much shorter,for example testing only 1500 sets of coefficients as opposed to 5000.In addition, because it is desirable that the warp functions generatedbe continuous, subsequent windows may in fact be evaluated as two“sub-windows.” In the first sub-window, stretch coefficients areconstrained such, that the warp function does not change the offsetalready established at the center of the previous full window. In thesecond sub-window, the stretch is allowed to vary in a narrow range.

Ultimately, by this process of sliding a window in overlapping stepsacross the combined normalized data, a “warp function” is arrived at foreach normalized data trace. This function reflects the relationshipbetween optimum-cost shift values for each window. By connecting thedetermined points of the function, a curve is defined which givescoefficients of shift for each point in a data trace, and reflectsstretch at each location by the slope of the curve.

As a general matter, warp functions are generated in this way for eachdata trace. Thus, if the coefficients are determined with reference to afixed standard, four different warp functions reflecting the varyingcoefficients needed to align the four data traces are generated. Inpractice, however, it will be understood that the warp functions may bedetermined relative to one of the four data traces. In this case, thecoefficients for the one fixed data trace will all be one. In thecontext of this application, the phrase “generating a warp function foreach normalized data trace” encompasses both of these embodiments. Thewarp function may be represented by a plot of alignment of data pointsof the three warped traces, e.g., C, G and T against A as shown in FIG.8A or as a plot of displacement from the A trace versus data point asshown in FIG. 8B.

Each warp function is applied to its respective raw or conditioned datatrace to adjust the alignment of the data trace and generate a “warpeddata trace.” The four warped data traces are then combined in alignmentto produce an aligned data set.

An additional peak spacing warp function may be generated and applied toadjust for variations in peak spacing as part of creating the warpeddata traces. The peak spacing warp function, and also the baseseparation function useable as a standard trace, are generated bytesting postulated peak separation values in successive windows of thedata trace and minimizing a peak separation cost function for eachwindow. A suitable cost function is

COST=Σd ²

where d is the distance between each actual peak and a hypothetical peaklocated at the position fixed by the postulated peak separation value.

Presentation of the aligned data set may be done internally within acomputer for use with base-calling functions, or it may involve displayof the aligned data set on a video monitor. Either way, the presentationallows further use to be made of the modified output signals, forbase-calling and other purposes. For example, the video display ofaligned data may be useful to permit an operator to make manualadjustments, and to observe inaccuracies in base-calling.

Base-calling on the aligned data set may be performed in a variety ofways, including those base-calling techniques described in Tibbetts andBowling (U.S. Pat. No. 5,365,455) and Dam et al (U.S. Pat. No.5,119,316). Two preferred approaches to base-calling are described indetail below.

In the first approach for base-calling, peaks in each warped data tracemaking up the aligned data set are identified in the same manner inwhich peak detection was performed prior to normalization of the datatraces. A minimum peak height from the base-line may be selected by theoperator to avoid spurious results. Identified peaks are then used forbase-calling.

Occasionally, peaks may represent a plurality of bands. It is necessaryto determine which peaks these are, and how many bands they represent.An excellent method to employ is gaussian deconvolution whereby a peakis deconvolved into one or more standard gaussian peaks representingsingleton peaks. It is found that peaks generated from DNA sequencingreactions using T7 polymerase (Pharmacia, Sweden) and Thermo Sequenase™(Amersham Life Sciences) generate the most consistent gaussian peaks.

The standard gaussian peak is determined as shown in FIG. 9. Peaks arelocated in a conditioned data trace from one channel. A line is drawnbetween peak points. The point on the line halfway between peaks isjoined to the data trace by a line L perpendicular to the baseline. Thearea under the curve A and between the two perpendicular lines (L(x),L(x+1)) is determined. Height (h) is measured from the baseline to thepeak. h and A are used to calculate sigma (σ) according to the equation:$\sigma = \frac{A}{h\sqrt{\pi}}$

where σ represents the distance on the x-axis between the peak and thepoint at which the value of the gaussian function

y=e ^(−(x/σ)2)

equal 1/e.

For each detected peak, σ is determined. For those peaks where thelength of both L(x) and L(x+1) are greater than ½ h, a linear regressionis performed on the value of σ. Statistically, at most about 25% ofpeaks are expected to represent doubletons, triples or greater, so usinga second linear regression to correct for a width trend over eachwindow, the narrowest 50% of the peaks are selected for use inconstructing a piecewise cubic “singleton width discriminant function”that specifies a model width of a singleton at each location in the dataarray. All peaks in the window that are narrower or equal to thisfunction are deemed to be singletons to a first approximation. Thisapproximation may be further refined using constraints such as peakarea, etc. (FIG. 10) In this way, a standard or model singleton gaussianpeak height and width at any point on the data trace may be defined.

The characteristics of the standard gaussian peak(s) and the positionsof the singletons found via the discriminant function in conjunctionwith the base separation function are then used to classify all thepeaks in the aligned data traces. The features (e.g., height, widthand/or area) of the standard peak are compared to the features of adetected signal peak to determine the number of bases represented.

For example if the standard spacing indicated by the base separationfunction is consistent with there being three peaks in a region betweena pair of singletons and that region is occupied by a large peak, thecharacteristics of the standard gaussian peak (area, height etc.) areused to determine whether two, three or four peaks are most likely to bethe number of base pairs represented by the large peak. Thus, forexample, the difference between the area of the large peak and the areaof the standard gaussian peak can be evaluated. If the area of the largepeak is approximately three times the area of the standard peak, thenthe large peak is treated as representing three bases. Similarly, if thearea is closer to four times the area of the standard peak, the largepeak is treated as representing four bases. A similar, although lesssensitive analysis can be performed based on the height of the peaks.The preferred analysis takes multiple peak features into account.

The primer peak may be eliminated or ignored as described for preparingthe normalized signal output, so as not to interfere with base-calling.

Another method of identifying and eliminating the primer peak uses apeak counting method. The data stream is divided into windows of acertain number of data points. The peaks in each window are counted.When a primer peak is in the window, a window that normally wouldinclude 10 peaks, may have only 2 peaks. This window is eliminated fromconsideration, and other windows are used for alignment andbase-calling.

Once the individual peaks are identified and the multiple peak curvesare divided into individual gaussian peaks, the data may be base-called.Each peak is identified with one channel, representing a single base.Peaks are therefore assigned to specific bases, in sequential order,until the full sequence is identified.

FIG. 11 outlines a second approach to base-calling using the aligneddata set of the invention. In this case, peaks are identified within thealigned data set, for example as discussed above, and the data set isthen divided into segments or windows. Within each window, a subset ofthe peaks are selected, for example 45% of the peaks, using selectioncriteria which select those peaks which are most likely to be “singletonpeaks,” i.e., peaks which represent only a single base and where thatbase is not the same as either adjacent base such that there should beno compression with adjacent peaks.

The selection of singleton peaks follows a several step process in whichthe features of each peak, i.e., width, height and area etc. areextracted and plotted in n-space. Based on the distribution of the peaksand the fact that, statistically, 56% of the bases in any sequenceshould be singletons within this definition, the region of n-space mostlikely to contain singleton peaks is defined. The peaks which fallwithin this region of space are then collected.

From the first pass singleton classification, piecewise singleton heightand width functions that estimate model singleton features are definedfor each segment or window. These functions are then used to reclassifythe peaks from the aligned data set as singleton or non-singleton peaks.From this refined singleton classification, a single piecewise singlebase separation distance function that estimates base-to-base separationis determined and used to provide postulated positions for all bases.These positions, combined with the actual peak features, neighboringcharacteristics and raw signal characteristics are then used to providea best estimate of base call, together with a measure of the certaintyof that call at each postulated base positions.

The method of the invention is advantageously practiced using adedicated apparatus for determining nucleic acid sequences. As depictedin FIG. 12, such an apparatus comprises a sequencer 90 having anelectrophoresis gel holder 901 disposed between electrodes 902 which areused to apply an electric field to a gel placed in the holder to causeoligonucleotide fragments to migrate within the gel; a detection systemcomprising a source 908 for an interrogating beam 92 and a detector 907for detecting the passage of oligonucleotide fragments through adetection zone, for example by monitoring emitted light 99; and a dataprocessing system 96 operatively connected to the detector 95 forreceiving raw data traces for each of the four chain termination productmixtures for a sample. Suitable gel holders, electrodes and detectionsystems are disclosed in U.S. patent application No. 08/353,932, PCTpatent application No. PCT/US95/15951, and U.S. patent application Ser.No. U.S. patent application Ser. No. 08/387,272, all of which areincorporated herein by reference, although it will be understood thatthe particular configuration of the electrophoresis and detection systemis not critical to the present invention.

The data processing system is suitably a personal or mini-computer whichhas stored therein a programmed instruction set effective to

identify peaks in each of the four data traces;

normalize the height of said peaks in each of said data traces to acommon value to generate four normalized data traces;

combine the four normalized data traces in an initial alignment;

determine coefficients of shift and stretch for selected data pointswithin each normalized data trace, said coefficients optimizing a costfunction which reflects the extent of overlap of peaks in combinednormalized data traces to which the coefficients have been applied, saidcost function being optimized when the extent of overlap is at aminimum;

generate a warp function for each normalized data trace from thecoefficients of shift and stretch determined for the selected datapoints;

apply each warp functions to the respective data trace or the normalizeddata trace to produce four warped data traces; and

align the four warped data traces to form an aligned data set. The dataprocessing system may be connected to a video display 97 for displayingthe aligned data set.

In a preferred embodiment of the invention, the apparatus of theinvention reports confidence levels to the system operator for some orall of the bases identified in the sequence. The confidence leveladvantageously reflects both (1) the arithmetic agreement between thesignal and the model, and (2) other features of the data signal (forexample expanded peak width) which may indicate reasons that theconfidence level should be lower than the apparent level based onarithmetic agreements. This confidence level can be reported for allpeaks, or it can be reported only for those peaks for which theconfidence level falls below a selected threshold value. Peaks may alsobe flagged during the reporting process to report ambiguities in theidentification of the number of bases represented by a multiple peakfeature.

What is claimed is:
 1. A method for aligning data traces from fourchannels of an automated electrophoresis detection apparatus, eachchannel detecting the products of one of four chain-termination DNAsequencing reactions, whereby said four channels together provideinformation concerning the position of all four bases within a nucleicacid polymer being analyzed, comprising the steps of: (a) identifyingpeaks in each of the four data traces; (b) combining the four datatraces in an initial alignment; (c) determining coefficients of shiftand stretch for selected data points within each data trace, saidcoefficients optimizing a cost function which reflects the suitabilityof a trial alignment resulting from the application of the trial stretchand shift coefficients to the combined data traces; (d) generating awarp function for each data trace from the coefficients of shift andstretch determined for the selected data points; (f) applying each warpfunctions to the respective data trace to produce four warped datatraces; and (g) assembling the four warped data traces to form analigned data set.
 2. The method of claim 1, further comprising the stepof normalizing the data traces to generate four normalized data tracesin which homozygous peaks are all of substantially equal height prior tocombining the four data traces into an initial alignment.
 3. The methodof claim 1, wherein the cost function reflects the extent of overlap ofpeaks in the combined data traces to which the coefficients have beenapplied.
 4. The method of claim 1, wherein the cost function determinesthe area of a region above the combined data traces and below the commonvalue, and wherein the cost function is optimized when this area is at aminimum.
 5. The method of claim 1, wherein the cost function determinesthe area of a region below the combined data traces, and wherein thecost function is optimized when this area is at a maximum.
 6. The methodof claim 1, wherein the cost function determines the area of a firstregion above the combined data traces and the area of a second region,said second region being below t he highest-edge of the combined datatraces and above the second highest edge of the combined data traces,wherein the cost function is optimized when the first area is minimizedand the second area is maximized.
 7. The method of claim 1, wherein thecoefficients of shift and stretch for selected data points within eachdata trace that yield the optimum value of the cost function aredetermined using a process of simulated annealing.
 8. A method fordetermining the sequence of a nucleic acid polymer comprising the stepsof: (a) obtaining data traces from four channels of an automatedelectrophoresis detection apparatus, each channel detecting the productsof one of four chain-termination DNA sequencing reactions, whereby saidfour channels together provide information concerning the position ofall four bases within a nucleic acid polymer being analyzed; (b)aligning the data traces by a method according to claim 1 to produce analigned data set; (c) evaluating the aligned data set to determine thesequence of bases within the nucleic acid polymer.
 9. The methodaccording to claim 8, further comprising the steps of determiningstandard gaussian peak shapes for data points along the aligned dataset; assigning peaks in the aligned data set as singleton peaks ormultiple peaks by comparison of the peaks in the aligned data set to thestandard gaussian peak characteristics; and determining how manystandard gaussian peaks are contained within each multiple peak, whereineach multiple peak is treated as that number of singleton peaks forpurposes of base-calling.
 10. The method according to claim 8, furthercomprising the step of reporting confidence levels for at least ofportion of the called base peaks.
 11. The method according to claim 10,wherein the confidence level reflects the arithmetic agreement betweenthe peak in the data trace and a model, and those features of the peakwhich might justify assignment of a lower confidence level.
 12. Anapparatus for determining the sequence of a nucleic acid polymercomprising: (a) an electrophoresis gel holder; (b) first and secondelectrodes disposed to apply an electric field to the electrophoresisgel disposed within the electrophoresis gel holder to causeoligonucleotide fragments loaded on the electrophoresis gel to migratewithin the electrophoresis gel; (c) a detection system comprising aninterrogating beam and a detector for detecting the passage ofoligonucleotide fragments through a detection zone; and (d) a dataprocessing system operatively connected to the detector for receivingfour data traces, one for each of four chain termination productmixtures for the nucleic acid polymer, wherein the data processingsystem has stored therein a programmed instruction set effective toidentify peaks in each of the four data traces; combine the four datatraces in an initial alignment; determine coefficients of shift andstretch for selected data points within each data trace, saidcoefficients optimizing a cost function which reflects the suitabilityof a trial alignment resulting from the application of the trial stretchand shift coefficients to the combined data traces; generate a warpfunction for each data trace from the coefficients of shift and stretchdetermined for the selected data points; apply each warp functions tothe respective data trace to produce four warped data traces; andassemble the four warped data traces to form an aligned data set. 13.The apparatus according to claim 12, further comprising a video displayfor displaying the aligned data set.